Skip to main content

Monitoring, Observability & Infrastructure Interview Guide

Prometheus (25 Questions)โ€‹

Core Conceptsโ€‹

  1. What is Prometheus? Key features
  2. What is the Prometheus architecture? Components (Prometheus Server, Pushgateway, Alertmanager, Exporters)
  3. What is a time-series database? How does Prometheus store data?
  4. What is the difference between monitoring and observability?
  5. What are the four golden signals of monitoring? (Latency, Traffic, Errors, Saturation)

Metrics & Data Modelโ€‹

  1. What are metrics in Prometheus? Types of metrics:
    • Counter
    • Gauge
    • Histogram
    • Summary
  2. What is a metric label? How to use labels effectively?
  3. What is cardinality? Why is high cardinality problematic?
  4. What is the difference between histogram and summary?
  5. When to use Counter vs Gauge?
  6. What is metric naming convention in Prometheus?
  7. What is the data retention period in Prometheus?

PromQL (Prometheus Query Language)โ€‹

  1. What is PromQL? Basic query syntax
  2. What is an instant vector vs range vector?
  3. Common PromQL functions:
    • rate() - Calculate per-second rate
    • irate() - Instant rate
    • increase() - Total increase
    • sum(), avg(), max(), min()
    • by and without aggregations
  4. What is the difference between rate() and irate()?
  5. How to calculate percentiles in PromQL? (histogram_quantile)
  6. How to filter metrics by labels in PromQL?
  7. What is up metric? How to use it for health checks?

Integration & Exportersโ€‹

  1. What is an exporter in Prometheus?
  2. Common exporters:
    • Node Exporter (system metrics)
    • JMX Exporter (Java applications)
    • Blackbox Exporter (endpoint monitoring)
    • Custom exporters
  3. How to integrate Prometheus with Spring Boot? (Micrometer, Actuator)
  4. What is Pushgateway? When to use it?
  5. What is service discovery in Prometheus? (static config, DNS, Kubernetes, Consul)

Alertingโ€‹

  1. What is Alertmanager? How does it work?
  2. How to configure alerts in Prometheus?
  3. What is alert routing and grouping?
  4. What is silencing in Alertmanager?
  5. What are best practices for alert thresholds?

Grafana (20 Questions)โ€‹

Core Conceptsโ€‹

  1. What is Grafana? Key features
  2. What is the difference between Prometheus and Grafana?
  3. What are Grafana data sources? (Prometheus, InfluxDB, Elasticsearch, CloudWatch, MySQL)
  4. What is a dashboard in Grafana?
  5. What is a panel? Types of panels (Graph, Gauge, Table, Heatmap, Stat)

Dashboards & Visualizationโ€‹

  1. How to create a dashboard in Grafana?
  2. What are dashboard variables? Use cases
  3. What is templating in Grafana?
  4. How to create dynamic dashboards?
  5. What are dashboard annotations?
  6. What is the difference between absolute time and relative time ranges?
  7. How to share dashboards? (export JSON, snapshots, public dashboards)

Alertingโ€‹

  1. How to configure alerts in Grafana?
  2. What are notification channels? (Email, Slack, PagerDuty, Webhook)
  3. What is alert state? (Pending, Alerting, OK, No Data)
  4. What is the difference between Grafana alerts and Prometheus alerts?

Advanced Featuresโ€‹

  1. What is Grafana Loki? How is it different from Elasticsearch?
  2. What is Grafana Tempo? (distributed tracing)
  3. What are Grafana plugins?
  4. How to use Grafana with Kubernetes?
  5. What are best practices for dashboard design?

ELK Stack (Elasticsearch, Logstash, Kibana) (30 Questions)โ€‹

Elasticsearchโ€‹

  1. What is Elasticsearch? Core concepts
  2. What is an index in Elasticsearch?
  3. What is a document and document ID?
  4. What is a shard? Primary shard vs replica shard
  5. Why is sharding important in Elasticsearch?
  6. What is an inverted index?
  7. What is a mapping in Elasticsearch?
  8. What are analyzers? Common analyzers (Standard, Whitespace, Keyword, Pattern)
  9. What is the difference between text and keyword data types?
  10. How does Elasticsearch achieve near real-time search?
  11. What is a cluster, node, and index in Elasticsearch?
  12. What is the difference between GET and SEARCH API?
  13. What are query types in Elasticsearch?
    • Match query
    • Term query
    • Range query
    • Bool query
    • Wildcard query
  14. What is aggregation in Elasticsearch? (Bucket, Metric, Pipeline)
  15. What is the difference between term query and match query?
  16. How to perform full-text search in Elasticsearch?
  17. What is scoring and relevance in Elasticsearch?
  18. How to optimize Elasticsearch performance?
  19. What is circuit breaker in Elasticsearch?
  20. What is index lifecycle management (ILM)?

Logstashโ€‹

  1. What is Logstash? Architecture
  2. What are the three stages of Logstash pipeline? (Input, Filter, Output)
  3. Common Logstash input plugins (file, beats, kafka, jdbc)
  4. Common Logstash filter plugins (grok, mutate, date, json, geoip)
  5. Common Logstash output plugins (elasticsearch, file, kafka, stdout)
  6. What is Grok pattern? How to use it?
  7. What is the difference between Logstash and Filebeat?
  8. How to handle log parsing errors in Logstash?

Kibanaโ€‹

  1. What is Kibana? Key features
  2. What is Discover in Kibana?
  3. How to create visualizations in Kibana? (Bar, Line, Pie, Heatmap, Data Table)
  4. What is a Kibana dashboard?
  5. What are index patterns in Kibana?
  6. What is Kibana Query Language (KQL)?
  7. What is Kibana Lens?
  8. How to create alerts in Kibana?
  9. What is Canvas in Kibana?

ELK Stack Integration & Best Practicesโ€‹

  1. What is the typical ELK stack workflow?
  2. What are Beats? (Filebeat, Metricbeat, Packetbeat, Heartbeat, Auditbeat)
  3. When to use Logstash vs Filebeat?
  4. How to secure ELK stack? (authentication, encryption, role-based access)
  5. What are best practices for log management?
  6. How to handle large-scale log ingestion?
  7. What is hot-warm-cold architecture in Elasticsearch?

Apache Kafka (35 Questions)โ€‹

Core Conceptsโ€‹

  1. What is Apache Kafka? Use cases
  2. What is the Kafka architecture? Components:
    • Broker
    • Topic
    • Partition
    • Producer
    • Consumer
    • Zookeeper/KRaft
  3. What is a topic in Kafka?
  4. What is a partition? Why is partitioning important?
  5. What is a broker in Kafka?
  6. What is a Kafka cluster?
  7. What is the role of Zookeeper in Kafka?
  8. What is KRaft mode? Difference from Zookeeper
  9. What is a message/record in Kafka? (Key, Value, Timestamp, Headers)

Producersโ€‹

  1. What is a Kafka producer?
  2. How does a producer send messages to Kafka?
  3. What is producer acknowledgment (acks)? (0, 1, all/-1)
  4. What is idempotent producer?
  5. What is the difference between sync and async send?
  6. What is partitioner? How does producer choose partition? (key-based, round-robin, custom)
  7. What is producer batching?
  8. What are producer configuration parameters?
  • batch.size
  • linger.ms
  • compression.type
  • max.in.flight.requests.per.connection

Consumersโ€‹

  1. What is a Kafka consumer?
  2. What is a consumer group?
  3. How does Kafka achieve load balancing among consumers?
  4. What is consumer offset?
  5. What is offset commit? Auto-commit vs manual commit
  6. What happens when a consumer fails? (rebalancing)
  7. What is consumer lag? How to monitor it?
  8. What is the difference between poll() and subscribe()?
  9. What is enable.auto.commit?
  10. What are consumer configuration parameters?
  • group.id
  • auto.offset.reset (earliest, latest, none)
  • max.poll.records
  • session.timeout.ms

Replication & Fault Toleranceโ€‹

  1. What is replication in Kafka?
  2. What is replication factor?
  3. What is leader and follower?
  4. What is ISR (In-Sync Replica)?
  5. How does Kafka ensure message durability?
  6. What is min.insync.replicas?
  7. What happens when a broker fails?
  8. What is unclean leader election?

Performance & Scalabilityโ€‹

  1. How does Kafka achieve high throughput?
  2. What is log compaction?
  3. What is retention policy in Kafka? (time-based, size-based)
  4. How to scale Kafka? (add brokers, increase partitions)
  5. What is the relationship between partitions and parallelism?
  6. What are Kafka performance tuning tips?
  7. What is the difference between Kafka and traditional message queues? (RabbitMQ, ActiveMQ)

Kafka Streams & Connectโ€‹

  1. What is Kafka Streams?
  2. What is Kafka Connect?
  3. What are source and sink connectors?
  4. When to use Kafka Streams vs Kafka Connect?

Monitoring & Operationsโ€‹

  1. How to monitor Kafka? (JMX metrics, Kafka Manager, Burrow)
  2. Important Kafka metrics to monitor:
  • UnderReplicatedPartitions
  • OfflinePartitionsCount
  • ActiveControllerCount
  • RequestHandlerAvgIdlePercent
  1. What is Kafka MirrorMaker?

Redis (30 Questions)โ€‹

Core Conceptsโ€‹

  1. What is Redis? Key features
  2. What makes Redis fast? (in-memory, single-threaded, efficient data structures)
  3. What are Redis data types?
  • String
  • List
  • Set
  • Sorted Set
  • Hash
  • Bitmap
  • HyperLogLog
  • Stream
  1. What is Redis use case? (caching, session storage, rate limiting, real-time analytics)
  2. What is the difference between Redis and Memcached?
  3. Is Redis single-threaded? How does it handle concurrent requests?

Data Structures & Commandsโ€‹

  1. Important String commands (SET, GET, INCR, DECR, MSET, MGET)
  2. Important List commands (LPUSH, RPUSH, LPOP, RPOP, LRANGE)
  3. Important Set commands (SADD, SMEMBERS, SINTER, SUNION, SDIFF)
  4. Important Sorted Set commands (ZADD, ZRANGE, ZRANK, ZINCRBY)
  5. Important Hash commands (HSET, HGET, HGETALL, HINCRBY)
  6. What is the time complexity of common Redis operations?
  7. What is SCAN command? Difference from KEYS
  8. What are Redis transactions? (MULTI, EXEC, DISCARD, WATCH)

Persistenceโ€‹

  1. What are Redis persistence mechanisms?
  • RDB (Redis Database Backup)
  • AOF (Append-Only File)
  1. What is the difference between RDB and AOF?
  2. When to use RDB vs AOF?
  3. What is hybrid persistence (RDB+AOF)?
  4. What is snapshotting in Redis?

Caching Strategiesโ€‹

  1. What are caching strategies?
  • Cache-Aside (Lazy Loading)
  • Write-Through
  • Write-Behind (Write-Back)
  • Read-Through
  1. What is cache eviction policy? (LRU, LFU, FIFO, Random, TTL)
  2. What is TTL (Time To Live)?
  3. How to handle cache stampede?
  4. What is cache penetration, cache breakdown, and cache avalanche?
  5. How to implement distributed locking in Redis? (SETNX, RedLock algorithm)

High Availability & Scalabilityโ€‹

  1. What is Redis Sentinel? How does it work?
  2. What is Redis Cluster? How does it achieve scalability?
  3. What is the difference between Redis Sentinel and Redis Cluster?
  4. How does Redis Cluster handle data sharding?
  5. What is hash slot in Redis Cluster?
  6. What is split-brain problem in Redis?
  7. What is Redis replication? Master-slave architecture
  8. How to handle failover in Redis?

Performance & Monitoringโ€‹

  1. How to monitor Redis? (INFO command, redis-cli, monitoring tools)
  2. Important Redis metrics:
  • Memory usage
  • Hit rate
  • Connected clients
  • Commands processed per second
  • Evicted keys
  1. How to optimize Redis performance?
  2. What is pipelining in Redis?
  3. What is Redis pub/sub?
  4. What are Redis Streams? Use cases
  5. What is the maximum size of a Redis key/value?

CDN (Content Delivery Network) (20 Questions)โ€‹

Core Conceptsโ€‹

  1. What is CDN? How does it work?
  2. What are the benefits of using CDN? (reduced latency, improved performance, DDoS protection, reduced bandwidth cost)
  3. What is edge server/edge location?
  4. What is origin server?
  5. What is Point of Presence (PoP)?
  6. How does CDN routing work? (DNS-based routing, Anycast)

CDN Types & Architectureโ€‹

  1. What are types of CDN?
  • Push CDN
  • Pull CDN
  1. What is the difference between push and pull CDN?
  2. What is CDN caching? Cache hierarchy
  3. What is cache hit ratio?
  4. What is Time To Live (TTL) in CDN?
  5. What is cache invalidation/purging?

CDN Featuresโ€‹

  1. What is edge computing?
  2. What is CDN load balancing?
  3. How does CDN handle dynamic content?
  4. What is CDN SSL/TLS termination?
  5. What is geo-blocking in CDN?
  6. What are CDN security features? (DDoS protection, WAF, bot protection)
  7. What is image optimization in CDN?
  8. What is compression in CDN? (Gzip, Brotli)
  1. Popular CDN providers (Cloudflare, Akamai, AWS CloudFront, Fastly, Azure CDN)
  2. What is CloudFlare? Key features
  3. What is AWS CloudFront?
  4. How to integrate CDN with your application?

Spring Boot Actuator (25 Questions)โ€‹

Core Conceptsโ€‹

  1. What is Spring Boot Actuator?
  2. How to enable Actuator in Spring Boot?
  3. What are Actuator endpoints?
  4. What is the difference between web endpoints and JMX endpoints?

Built-in Endpointsโ€‹

  1. Important Actuator endpoints:
  • /actuator/health - Application health
  • /actuator/info - Application information
  • /actuator/metrics - Application metrics
  • /actuator/env - Environment properties
  • /actuator/beans - Spring beans
  • /actuator/mappings - Request mappings
  • /actuator/loggers - Logger configuration
  • /actuator/threaddump - Thread dump
  • /actuator/heapdump - Heap dump
  • /actuator/prometheus - Prometheus metrics
  1. What is health indicator? (disk space, database, Redis, custom)
  2. How to create custom health indicators?
  3. What is health status? (UP, DOWN, OUT_OF_SERVICE, UNKNOWN)
  4. How to expose/hide specific endpoints?
  5. What is the /info endpoint? How to add custom info?

Metricsโ€‹

  1. What metrics are available in Actuator?
  • JVM metrics (memory, threads, GC)
  • HTTP metrics (request count, response time)
  • Database metrics (connection pool)
  • Custom metrics
  1. How to create custom metrics? (MeterRegistry, Counter, Gauge, Timer)
  2. What is Micrometer?
  3. How to integrate Actuator with Prometheus?
  4. What is dimensional metrics?

Security & Configurationโ€‹

  1. How to secure Actuator endpoints?
  2. What is the role of Spring Security with Actuator?
  3. How to configure endpoint exposure? (management.endpoints.web.exposure.include)
  4. What is base path for Actuator? (management.endpoints.web.base-path)
  5. How to customize Actuator endpoints?
  6. What is @Endpoint annotation?

Advanced Featuresโ€‹

  1. How to create custom Actuator endpoints?
  2. What is /auditevents endpoint?
  3. How to monitor application performance using Actuator?
  4. How to integrate Actuator with external monitoring systems? (Grafana, Prometheus, ELK)

Application Performance Monitoring (APM) (20 Questions)โ€‹

Core Conceptsโ€‹

  1. What is APM? Why is it important?
  2. What is distributed tracing?
  3. What is a trace, span, and trace ID?
  4. What is observability vs monitoring?
  5. Three pillars of observability (Logs, Metrics, Traces)

APM Toolsโ€‹

  1. Popular APM tools:
  • New Relic
  • Datadog
  • AppDynamics
  • Dynatrace
  • Elastic APM
  • Jaeger
  • Zipkin
  1. What is Zipkin? How does it work?
  2. What is Jaeger? Architecture
  3. What is Spring Cloud Sleuth?
  4. How to implement distributed tracing in Spring Boot? (Sleuth + Zipkin)

Metrics & Monitoringโ€‹

  1. What is Apdex score?
  2. What is response time percentiles? (p50, p95, p99)
  3. What is throughput and latency?
  4. What is error rate?
  5. How to monitor database query performance?
  6. How to identify performance bottlenecks?
  7. What is transaction tracing?
  8. What is Real User Monitoring (RUM)?
  9. What is Synthetic Monitoring?
  10. What is the difference between RUM and Synthetic Monitoring?

API Monitoring & Management (20 Questions)โ€‹

API Monitoringโ€‹

  1. What is API monitoring? Why is it important?
  2. What metrics should be monitored for APIs?
  • Response time
  • Error rate
  • Request rate
  • Availability/Uptime
  • Latency
  1. How to monitor API endpoints? (health checks, synthetic monitoring)
  2. What is API uptime monitoring?
  3. What are API monitoring tools? (Postman, Runscope, Pingdom, Uptime Robot)
  4. How to implement API health checks in Spring Boot?

API Gateway Monitoringโ€‹

  1. What is API Gateway?
  2. What metrics to monitor in API Gateway?
  3. How to monitor API Gateway performance?
  4. What is rate limiting in API Gateway?
  5. How to implement throttling?

API Logging & Analyticsโ€‹

  1. What should be logged for APIs?
  • Request/Response
  • Headers
  • Timestamps
  • User information
  • Errors
  1. How to implement structured logging for APIs?
  2. What is API analytics?
  3. How to track API usage patterns?
  4. What is request tracing?
  5. How to correlate logs across microservices? (correlation ID)

API Security Monitoringโ€‹

  1. How to monitor API security?
  2. What are common API security threats? (DDoS, SQL injection, unauthorized access)
  3. How to detect API abuse?
  4. What is anomaly detection in API monitoring?

Log Management & Best Practices (20 Questions)โ€‹

Logging Fundamentalsโ€‹

  1. What are log levels? (TRACE, DEBUG, INFO, WARN, ERROR, FATAL)
  2. When to use each log level?
  3. What is structured logging?
  4. What is the difference between structured and unstructured logs?
  5. What is JSON logging? Benefits
  6. What should be included in log messages?
  • Timestamp
  • Log level
  • Service name
  • Correlation ID
  • User ID
  • Error details

Logging Frameworksโ€‹

  1. Popular Java logging frameworks:
  • Logback
  • Log4j2
  • SLF4J (Simple Logging Facade)
  1. What is SLF4J? Why use it?
  2. What is the difference between Log4j, Log4j2, and Logback?
  3. How to configure logging in Spring Boot?
  4. What is logging pattern/layout?

Log Aggregationโ€‹

  1. What is log aggregation? Why is it important?
  2. What is centralized logging?
  3. How to implement centralized logging in microservices?
  4. What is log retention policy?
  5. How to handle log rotation?
  6. What is log sampling?

Best Practicesโ€‹

  1. What are logging best practices?
  • Use appropriate log levels
  • Include context information
  • Avoid logging sensitive data
  • Use correlation IDs
  • Implement log sampling for high-volume systems
  1. How to avoid logging sensitive information? (passwords, credit cards, PII)
  2. How to optimize log storage costs?
  3. What is log enrichment?
  4. How to search and analyze logs efficiently?

Alerting & Incident Management (20 Questions)โ€‹

Alerting Fundamentalsโ€‹

  1. What is alerting? Why is it important?
  2. What makes a good alert?
  3. What is alert fatigue? How to prevent it?
  4. What is the difference between alert and notification?
  5. What are alert severity levels? (Critical, High, Medium, Low)

Alert Typesโ€‹

  1. What are different types of alerts?
  • Threshold-based alerts
  • Anomaly detection alerts
  • Composite alerts
  1. What is threshold alerting?
  2. What is anomaly-based alerting?
  3. What is alert aggregation?
  4. What is alert deduplication?

Alert Configurationโ€‹

  1. What factors to consider when setting alert thresholds?
  2. What is alert hysteresis?
  3. What is alert flapping? How to prevent it?
  4. What is alert routing?
  5. What is on-call rotation?
  6. How to prioritize alerts?

Incident Managementโ€‹

  1. What is incident management process?
  2. What is incident severity classification?
  3. What is MTTR (Mean Time To Repair)?
  4. What is MTTD (Mean Time To Detect)?
  5. What is MTTA (Mean Time To Acknowledge)?
  6. What are incident management tools? (PagerDuty, Opsgenie, VictorOps)
  7. What is incident postmortem? Why is it important?
  8. What is runbook/playbook?

Scenario-Based Questions (40 Questions)โ€‹

Performance Issuesโ€‹

  1. Your application response time suddenly increased. How would you troubleshoot?
  2. How would you identify if the issue is in application, database, or network?
  3. CPU usage is at 100%. How would you investigate?
  4. Memory usage is continuously growing. How would you detect memory leaks?
  5. Database queries are slow. How would you optimize?
  6. How would you handle a sudden spike in traffic?
  7. Application is timing out. How would you debug?

Monitoring & Observabilityโ€‹

  1. How would you set up monitoring for a new microservice?
  2. What metrics would you monitor for a REST API?
  3. How would you monitor database performance?
  4. How would you implement distributed tracing across 10 microservices?
  5. How would you correlate logs across multiple services?
  6. How would you monitor Kafka consumer lag?
  7. How would you detect if a microservice is down?
  8. How would you monitor Redis cache hit rate?

Alerting & Incident Responseโ€‹

  1. You received an alert about high error rate. What steps would you take?
  2. How would you configure alerts to avoid false positives?
  3. Multiple alerts are firing. How would you prioritize?
  4. A critical service is down at 3 AM. Walk through your incident response process
  5. How would you implement on-call rotation for your team?
  6. How would you conduct a postmortem after an incident?

ELK Stack Scenariosโ€‹

  1. Elasticsearch cluster is slow. How would you optimize?
  2. How would you handle log ingestion of 1TB/day?
  3. Elasticsearch nodes are running out of memory. What would you do?
  4. How would you search for specific error messages across millions of logs?
  5. How would you implement log retention policy for cost optimization?
  6. Kibana dashboards are loading slowly. How would you troubleshoot?

Kafka Scenariosโ€‹

  1. Kafka consumer lag is increasing. How would you address it?
  2. A Kafka broker went down. What happens?
  3. How would you handle Kafka rebalancing issues?
  4. Messages are being duplicated. How would you ensure exactly-once delivery?
  5. How would you migrate Kafka cluster without downtime?
  6. How would you scale Kafka to handle 10x traffic?

Redis Scenariosโ€‹

  1. Redis is running out of memory. What would you do?
  2. Cache hit rate is very low. How would you improve it?
  3. How would you handle cache stampede during peak traffic?
  4. Redis master went down. How does failover work?
  5. How would you implement rate limiting using Redis?
  6. How would you migrate from single Redis instance to Redis Cluster?

Prometheus & Grafana Scenariosโ€‹

  1. Prometheus is consuming too much storage. How would you optimize?
  2. How would you monitor multiple Kubernetes clusters with Prometheus?
  3. Grafana dashboard is not showing recent data. What could be wrong?
  4. How would you create a dashboard for database performance monitoring?
  5. How would you set up alerts for API latency > 500ms?

CDN & API Gateway Scenariosโ€‹

  1. CDN cache hit rate is low. How would you improve it?
  2. Static assets are not being cached. How would you debug?
  3. How would you handle CDN cache invalidation for a critical update?
  4. API Gateway is becoming a bottleneck. How would you scale?
  5. How would you implement rate limiting at API Gateway level?

System Design with Monitoringโ€‹

  1. Design a monitoring system for an e-commerce application
  2. How would you monitor a microservices-based system with 50+ services?
  3. Design an alerting strategy for a payment processing system
  4. How would you implement observability in a serverless architecture?
  5. Design a logging strategy for a multi-region deployment

Best Practices & Guidelines (25 Questions)โ€‹

Monitoring Best Practicesโ€‹

  1. What are the key principles of effective monitoring?
  • Monitor what matters
  • Keep it simple
  • Avoid alert fatigue
  • Use meaningful metrics
  1. What is the USE method? (Utilization, Saturation, Errors)
  2. What is the RED method? (Rate, Errors, Duration)
  3. What metrics should be monitored at different layers?
  • Application layer
  • Infrastructure layer
  • Network layer
  • Database layer
  1. How to establish SLO (Service Level Objectives)?
  2. What is the difference between SLI, SLO, and SLA?

Logging Best Practicesโ€‹

  1. What are logging best practices in microservices?
  2. How to implement correlation across distributed systems?
  3. What should never be logged? (passwords, tokens, PII, credit cards)
  4. How to balance between detailed logging and performance?
  5. What is the cost of excessive logging?

Alerting Best Practicesโ€‹

  1. What makes an actionable alert?
  2. How many alerts are too many?
  3. What is alert-to-noise ratio?
  4. Should you alert on symptoms or causes?
  5. What is the difference between alerts and notifications?

Performance Best Practicesโ€‹

  1. What are performance monitoring best practices?
  2. How to establish baseline metrics?
  3. What is capacity planning?
  4. How to perform load testing with monitoring?
  5. What is chaos engineering? How does monitoring help?

Security & Complianceโ€‹

  1. How to ensure sensitive data is not exposed in logs?
  2. What are compliance requirements for log retention? (GDPR, HIPAA)
  3. How to implement audit logging?
  4. How to secure monitoring endpoints?
  5. What access controls should be in place for monitoring systems?

Tools Comparison (10 Questions)โ€‹

  1. Prometheus vs InfluxDB - When to use which?
  • Prometheus: Pull-based, optimized for metrics/monitoring, strong alerting, better for Kubernetes, PromQL
  • InfluxDB: Push-based, general-purpose time-series DB, better for IoT/sensor data, InfluxQL/Flux, built-in data retention
  • Use Prometheus for infrastructure monitoring, InfluxDB for application analytics
  1. ELK Stack vs Splunk - Pros and cons
  • ELK Stack:
    • Pros: Open-source, cost-effective, flexible, large community
    • Cons: Complex setup, resource-intensive, requires maintenance
  • Splunk:
    • Pros: Enterprise features, powerful analytics, better support, easier setup
    • Cons: Expensive licensing, cost scales with data volume
  • Use ELK for cost-sensitive projects, Splunk for enterprise with budget
  1. Grafana vs Kibana - Key differences
  • Grafana: Multi-source visualization, better for metrics/time-series, cleaner dashboards, alerting
  • Kibana: Tightly integrated with Elasticsearch, better for logs, built-in analytics, Elastic ecosystem
  • Use Grafana for metrics dashboards, Kibana for log analysis
  1. Kafka vs RabbitMQ - Use cases
  • Kafka:
    • High throughput, distributed streaming, log aggregation, event sourcing
    • Durable, replay capability, horizontal scaling
  • RabbitMQ:
    • Traditional message queue, complex routing, low latency, easier setup
    • Better for request-reply patterns
  • Use Kafka for event streaming/big data, RabbitMQ for traditional messaging
  1. Redis vs Memcached - Key differences
  • Redis:
    • Multiple data structures, persistence, pub/sub, clustering, Lua scripting
    • Single-threaded, feature-rich
  • Memcached:
    • Simple key-value only, multi-threaded, no persistence
    • Slightly faster for simple caching
  • Use Redis for complex use cases, Memcached for simple distributed caching
  1. Logstash vs Fluentd - Comparison
  • Logstash:
    • Elastic ecosystem, rich plugins, Grok patterns, Java-based (resource heavy)
  • Fluentd:
    • Lightweight (Ruby/C), better performance, Cloud Native, CNCF project
    • JSON native, unified logging layer
  • Use Logstash with ELK Stack, Fluentd for cloud-native/Kubernetes
  1. Jaeger vs Zipkin - Distributed tracing comparison
  • Jaeger:
    • Uber-developed, CNCF project, better for Kubernetes
    • Adaptive sampling, hot-path support
  • Zipkin:
    • Twitter-developed, simpler setup, more mature
    • Better documentation, wider adoption
  • Both are good choices; choose based on ecosystem fit
  1. New Relic vs Datadog - APM comparison
  • New Relic:
    • Strong APM focus, easier learning curve, better for application monitoring
    • Per-host pricing
  • Datadog:
    • Better infrastructure monitoring, more integrations, real-time analytics
    • Per-metric pricing, can be expensive
  • Choose based on primary use case (application vs infrastructure focus)
  1. CloudWatch vs Prometheus for AWS - When to use which?
  • CloudWatch:
    • Native AWS integration, no setup required, managed service
    • Limited retention, AWS-specific
  • Prometheus:
    • Open-source, flexible, better query language, cross-cloud
    • Self-managed, requires setup
  • Use CloudWatch for AWS-only, Prometheus for multi-cloud/detailed metrics
  1. Sentry vs ELK for error tracking - Comparison
  • Sentry:
    • Specialized error tracking, better error grouping, release tracking
    • Developer-friendly, issue assignment
  • ELK:
    • General-purpose logging, full-text search, broader use cases
    • More complex but more flexible
  • Use Sentry for application error tracking, ELK for comprehensive logging

Additional Advanced Topics (15 Questions)โ€‹

Observability as Codeโ€‹

  1. What is Observability as Code?
  • Defining monitoring, logging, and alerting configuration as code
  • Version control, peer review, automated deployment
  • Infrastructure as Code for observability
  1. What are benefits of GitOps for monitoring?
  • Version control for dashboards and alerts
  • Reproducible environments
  • Easy rollback and audit trail

Service Mesh Observabilityโ€‹

  1. What is service mesh? How does it help observability?
  • Istio, Linkerd, Consul Connect
  • Automatic distributed tracing
  • Standardized metrics collection
  • Traffic visibility without code changes
  1. What metrics does service mesh provide?
  • Request success rates
  • Latency distribution
  • Service dependencies
  • Circuit breaker stats

Cost Optimizationโ€‹

  1. How to optimize monitoring costs?
  • Sampling high-volume metrics
  • Data retention policies
  • Log level filtering
  • Metric aggregation
  • Use tiered storage (hot/warm/cold)
  1. What is metric cardinality explosion? How to prevent it?
  • Too many unique label combinations
  • Increases storage and query costs
  • Prevention: Limit label values, avoid unbounded labels, use label guidelines

Modern Observability Patternsโ€‹

  1. What is OpenTelemetry?
  • Vendor-neutral observability framework
  • Unified APIs for traces, metrics, logs
  • Auto-instrumentation support
  • CNCF project
  1. What is eBPF in observability?
  • Extended Berkeley Packet Filter
  • Kernel-level observability without agents
  • Low overhead monitoring
  • Tools: Pixie, Cilium
  1. What is continuous profiling?
  • Always-on performance profiling
  • Production-safe profiling
  • Identify performance regressions
  • Tools: Pyroscope, Parca

SRE & Reliabilityโ€‹

  1. What is SRE (Site Reliability Engineering)?
  • Applies software engineering to operations
  • Focus on reliability, scalability, automation
  • Error budgets and SLOs
  1. What is error budget?
  • Acceptable downtime based on SLO
  • Balance between reliability and feature velocity
  • Example: 99.9% uptime = 43 minutes downtime/month
  1. What are the four golden signals of SRE?
  • Latency: Time to serve requests
  • Traffic: Demand on system
  • Errors: Rate of failed requests
  • Saturation: Resource utilization

Cloud-Native Monitoringโ€‹

  1. How to monitor Kubernetes clusters?
  • Prometheus Operator
  • kube-state-metrics
  • Node exporter
  • cAdvisor for container metrics
  • Grafana dashboards
  1. What is container monitoring? Key metrics
  • CPU and memory usage per container
  • Container restart count
  • Network I/O
  • Disk I/O
  • Tools: cAdvisor, Datadog, New Relic
  1. How to monitor serverless applications?
  • Cold start duration
  • Invocation count and errors
  • Duration and memory usage
  • CloudWatch for AWS Lambda
  • Distributed tracing challenges

Real-World Integration Patterns (10 Questions)โ€‹

  1. How to integrate Prometheus with Spring Boot microservices?
  • Add Micrometer dependency
  • Enable Actuator with Prometheus endpoint
  • Configure Prometheus scraping
  • Create Grafana dashboards
  1. How to set up centralized logging for microservices?
  • Filebeat on each service โ†’ Logstash โ†’ Elasticsearch โ†’ Kibana
  • Add correlation ID to all logs
  • Structured JSON logging
  • Log aggregation pattern
  1. How to implement health checks across microservices?
  • Liveness probes (is service running?)
  • Readiness probes (can service handle traffic?)
  • Custom health indicators
  • Aggregate health status
  1. How to monitor API Gateway (Kong/AWS API Gateway)?
  • Request/response metrics
  • Rate limiting metrics
  • Authentication success/failure
  • Backend service health
  • Integration with Prometheus/CloudWatch
  1. How to integrate Kafka with monitoring systems?
  • JMX Exporter for Prometheus
  • Consumer lag monitoring (Burrow)
  • Kafka Manager/AKHQ for UI
  • Alert on lag, under-replicated partitions
  1. How to monitor database connections in Spring Boot?
  • HikariCP metrics via Actuator
  • Monitor active, idle, pending connections
  • Connection pool saturation alerts
  • Query performance with slow query logs
  1. How to implement circuit breaker monitoring?
  • Resilience4j with Micrometer
  • Monitor state transitions (closed/open/half-open)
  • Success/failure rates
  • Visualize in Grafana
  1. How to trace requests across API Gateway โ†’ Microservices โ†’ Database?
  • Spring Cloud Sleuth for trace ID generation
  • Propagate trace context in HTTP headers
  • Zipkin/Jaeger for trace collection
  • Visualize complete request flow
  1. How to implement custom business metrics?
  • MeterRegistry in Spring Boot
  • Counter for events (orders, signups)
  • Timer for operations
  • Gauge for current state
  • Export to Prometheus
  1. How to monitor scheduled jobs/batch processes?
  • Job execution time
  • Success/failure rate
  • Last successful run timestamp
  • Dead letter queue monitoring
  • Alert on job failures

Troubleshooting Checklist (10 Questions)โ€‹

  1. Application is slow - Where to start?

  2. Check application metrics (response time, throughput)

  3. Review recent deployments/changes

  4. Check resource utilization (CPU, memory, disk)

  5. Analyze slow queries in database

  6. Check external service dependencies

  7. Review logs for errors/warnings

  8. High memory usage - Investigation steps

  9. Take heap dump (jmap, Actuator)

  10. Analyze with tools (MAT, VisualVM)

  11. Check for memory leaks

  12. Review garbage collection logs

  13. Check cache sizes

  14. Monitor memory growth over time

  15. Database queries timing out - Debug approach

  16. Enable slow query log

  17. Check query execution plans (EXPLAIN)

  18. Look for missing indexes

  19. Check database connection pool

  20. Review lock contention

  21. Check database resource utilization

  22. Microservice not responding - Troubleshooting

  23. Check health endpoint

  24. Review application logs

  25. Check resource limits (CPU, memory)

  26. Verify network connectivity

  27. Check dependent services

  28. Review recent deployments

  29. Redis cache misses increasing - Investigation

  30. Check cache hit/miss ratio

  31. Verify TTL settings

  32. Check memory usage and eviction

  33. Review cache key patterns

  34. Look for cache invalidation issues

  35. Check client connection issues

  36. Kafka consumer lag growing - Resolution steps

  37. Check consumer processing time

  38. Verify partition assignment

  39. Scale consumers (add instances)

  40. Optimize consumer batch size

  41. Check for slow downstream dependencies

  42. Review consumer configuration

  43. Elasticsearch cluster yellow/red status - Fix

  44. Check unassigned shards

  45. Verify replica settings

  46. Check disk space on nodes

  47. Review cluster allocation settings

  48. Check for node failures

  49. Rebalance shards if needed

  50. Prometheus scraping failures - Troubleshooting

  51. Verify target is reachable

  52. Check firewall/network rules

  53. Verify metrics endpoint is exposed

  54. Check Prometheus logs

  55. Verify service discovery config

  56. Test metrics endpoint manually

  57. Grafana dashboard not updating - Debug

  58. Check data source connection

  59. Verify time range selection

  60. Check query syntax

  61. Review Prometheus/data source availability

  62. Check dashboard refresh settings

  63. Look for query errors in browser console

  64. API Gateway returning 5xx errors - Investigation

  65. Check gateway logs

  66. Verify backend service health

  67. Check timeout configurations

  68. Review rate limiting rules

  69. Check authentication/authorization

  70. Verify routing configuration


Interview Preparation Tipsโ€‹

Common Interview Patternsโ€‹

Pattern 1: Troubleshooting Scenarios

  • Always start with data/metrics
  • Follow systematic approach
  • Consider recent changes
  • Think about dependencies
  • Propose monitoring improvements

Pattern 2: System Design Questions

  • Define requirements first
  • Consider scale and load
  • Plan for failure scenarios
  • Include monitoring from start
  • Discuss trade-offs

Pattern 3: Tool Selection

  • Understand use case
  • Consider scale and cost
  • Think about team expertise
  • Integration with existing tools
  • Open-source vs commercial

Key Concepts to Masterโ€‹

  1. Metrics Collection: Pull vs Push, sampling, cardinality
  2. Log Aggregation: Centralization, parsing, storage, retention
  3. Distributed Tracing: Correlation, context propagation, sampling
  4. Alerting: Thresholds, alert fatigue, actionable alerts
  5. Scalability: Horizontal scaling, partitioning, caching
  6. High Availability: Replication, failover, disaster recovery

Quick Reference Metricsโ€‹

Application:

  • Response time (p50, p95, p99)
  • Request rate (req/sec)
  • Error rate (%)
  • Active users/connections

Infrastructure:

  • CPU utilization (%)
  • Memory usage (%)
  • Disk I/O (IOPS, throughput)
  • Network I/O (bytes/sec)

Database:

  • Query execution time
  • Connection pool usage
  • Slow query count
  • Replication lag

Cache (Redis):

  • Hit rate (%)
  • Memory usage
  • Evicted keys
  • Connected clients

Message Queue (Kafka):

  • Consumer lag
  • Message rate
  • Under-replicated partitions
  • Broker availability